[SPARK-48484][SQL] Fix: V2Write use the same TaskAttemptId for different task attempts by jackylee-ch · Pull Request #46811 · apache/spark

jackylee-ch · 2024-05-30T17:09:17Z

What changes were proposed in this pull request?

After #40064 , we always get the same TaskAttemptId for different task attempts which has the same partitionId. This would lead different task attempts write to the same directory.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

GA

Was this patch authored or co-authored using generative AI tooling?

No.

…ask attempts

jackylee-ch · 2024-05-31T02:57:17Z

cc @LuciferYang @yikf

LuciferYang · 2024-05-31T03:03:39Z

also cc @yaooqinn @HyukjinKwon @cloud-fan @pan3793

yaooqinn

+1, LGTM.

Can we have a test case?

yaooqinn · 2024-05-31T03:40:20Z

Please remove the [MINOR] tag and file a Jira ticket for this

cloud-fan · 2024-05-31T04:00:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWriterFactory.scala


  override def createWriter(partitionId: Int, realTaskId: Long): DataWriter[InternalRow] = {
-    val taskAttemptContext = createTaskAttemptContext(partitionId)
+    val taskAttemptContext = createTaskAttemptContext(partitionId, realTaskId.toInt & Int.MaxValue)


is it the same as math.abs?

Yep, it same as Math.abs(realTaskId.toInt)

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteFiles.scala

Line 97 in 80addbb

val sparkAttemptNumber = TaskContext.get().taskAttemptId().toInt & Int.MaxValue

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

Line 263 in 45ba922

sparkAttemptNumber = taskContext.taskAttemptId().toInt & Integer.MAX_VALUE,

There are at least two other similar cases here, should we unify them as math.abs? Of course, this should be another PR. @cloud-fan

I don't think perf matters here, and math.abs is definitely more readable.

ok, we can unify the above three cases to math.abs in a follow-up.

@cloud-fan If overflow, realTaskId.toInt & Int.MaxValue and math.abs are not equal:

scala> val realTaskId = Long.MaxValue val realTaskId: Long = 9223372036854775807 scala> val a = realTaskId.toInt val a: Int = -1 scala> val b = realTaskId.toInt & Int.MaxValue val b: Int = 2147483647 scala> val c= math.abs(realTaskId.toInt) val c: Int = 1

scala> val realTaskId = Int.MaxValue.toLong + 1 val realTaskId: Long = 2147483648 scala> val a = realTaskId.toInt val a: Int = -2147483648 scala> val b = realTaskId.toInt & Int.MaxValue val b: Int = 0 scala> val c= math.abs(realTaskId.toInt) val c: Int = -2147483648

Meanwhile, when an overflow occurs, math.abs may still return a negative value, so I suggest we continue using & Int.MaxValue

jackylee-ch · 2024-05-31T04:00:03Z

Please remove the [MINOR] tag and file a Jira ticket for this

Sure, I will add a follow up for SPARK-42478 and a suite test for this pr.

LuciferYang · 2024-05-31T04:31:38Z

No，SPARK-42478 is not part of the Spark 4.0 cycle, please use a new jira ticket. @jackylee-ch

LuciferYang · 2024-05-31T05:04:17Z

create SPARK-48484 for this one @jackylee-ch

ulysses-you · 2024-05-31T06:56:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWriterFactory.scala

@@ -38,7 +38,7 @@ case class FileWriterFactory (
  @transient private lazy val jobId = SparkHadoopWriterUtils.createJobID(jobTrackerID, 0)

  override def createWriter(partitionId: Int, realTaskId: Long): DataWriter[InternalRow] = {


Not related to this pr, why not naming it as taskAttemptId ? does realTaskId can be something else ?

LuciferYang · 2024-05-31T07:25:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWriterFactory.scala

can we use PrivateMethodTester in FileWriterFactorySuite to avoid expanding the scope of this function

yikf

thanks @jackylee-ch , lgtm

LuciferYang · 2024-05-31T08:05:47Z

...re/src/test/scala/org/apache/spark/sql/execution/datasources/v2/FileWriterFactorySuite.scala

if we just check createTaskAttemptContext, do we really need to inherit from SharedSparkSession? Can we just inherit from SparkFunSuite?

We need a Configuration here as it will be used in createTaskAttemptContext. It's ok to me that we just create a new Configuration.

...re/src/test/scala/org/apache/spark/sql/execution/datasources/v2/FileWriterFactorySuite.scala

…rces/v2/FileWriterFactorySuite.scala

…ent task attempts ### What changes were proposed in this pull request? After #40064 , we always get the same TaskAttemptId for different task attempts which has the same partitionId. This would lead different task attempts write to the same directory. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46811 from jackylee-ch/fix_v2write_use_same_directories_for_different_task_attempts. Lead-authored-by: jackylee-ch <lijunqing@baidu.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 67d11b1) Signed-off-by: yangjie01 <yangjie01@baidu.com>

LuciferYang · 2024-05-31T14:40:35Z

Merged into master/3.5/3.4. thanks @jackylee-ch @yaooqinn @cloud-fan @ulysses-you @yikf

[SPARK][SQL][MINOR] Fix: V2Write use same directories for different t…

715a591

…ask attempts

github-actions bot added the SQL label May 30, 2024

yaooqinn approved these changes May 31, 2024

View reviewed changes

LuciferYang approved these changes May 31, 2024

View reviewed changes

cloud-fan reviewed May 31, 2024

View reviewed changes

LuciferYang changed the title ~~[SPARK][SQL][MINOR] Fix: V2Write use the same TaskAttemptId for different task attempts~~ [SPARK-48484][SQL] Fix: V2Write use the same TaskAttemptId for different task attempts May 31, 2024

ulysses-you reviewed May 31, 2024

View reviewed changes

LuciferYang reviewed May 31, 2024

View reviewed changes

yikf approved these changes May 31, 2024

View reviewed changes

jackylee-ch force-pushed the fix_v2write_use_same_directories_for_different_task_attempts branch from 1988368 to 74782eb Compare May 31, 2024 07:49

LuciferYang reviewed May 31, 2024

View reviewed changes

add suite tests

3bb15e5

jackylee-ch force-pushed the fix_v2write_use_same_directories_for_different_task_attempts branch from 74782eb to 3bb15e5 Compare May 31, 2024 08:32

yaooqinn reviewed May 31, 2024

View reviewed changes

...re/src/test/scala/org/apache/spark/sql/execution/datasources/v2/FileWriterFactorySuite.scala Outdated Show resolved Hide resolved

Update sql/core/src/test/scala/org/apache/spark/sql/execution/datasou…

ead1f60

…rces/v2/FileWriterFactorySuite.scala

yaooqinn approved these changes May 31, 2024

View reviewed changes

LuciferYang closed this in 67d11b1 May 31, 2024

jackylee-ch mentioned this pull request Jun 1, 2024

[SPARK-48499][SQL] Use Math.abs to get Positive Numbers #46834

Closed

		@@ -38,7 +38,7 @@ case class FileWriterFactory (
		@transient private lazy val jobId = SparkHadoopWriterUtils.createJobID(jobTrackerID, 0)

		override def createWriter(partitionId: Int, realTaskId: Long): DataWriter[InternalRow] = {

Conversation

jackylee-ch commented May 30, 2024

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

jackylee-ch commented May 31, 2024

Uh oh!

LuciferYang commented May 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn left a comment

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented May 31, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackylee-ch commented May 31, 2024

Uh oh!

LuciferYang commented May 31, 2024

Uh oh!

LuciferYang commented May 31, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yikf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LuciferYang commented May 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

LuciferYang commented May 31, 2024 •

edited

Loading

LuciferYang Jun 3, 2024 •

edited

Loading